Replace or Retrieve Keywords In Documents at Scale
نویسنده
چکیده
In this paper we introduce, the FlashText,1 algorithm for replacing keywords or finding keywords in a given text. FlashText can search or replace keywords in one pass over a document. The time complexity of this algorithm is not dependent on the number of terms being searched or replaced. For a document of size N (characters) and a dictionary of M keywords, the time complexity will be O(N). This algorithm is much faster than Regex (see Figure 1 & 2), because regex time complexity is O(M *N). It is also different from Aho Corasick Algorithm,3 as it doesn’t match substrings. FlashText is designed to only match complete words (words with boundary characters,2 on both sides). For an input dictionary of {Apple}, this algorithm won’t match it to ‘I like Pineapple’. This algorithm is also designed to go for the longest match first. For an input dictionary {Machine, Learning, Machine learning} on a string ‘I like Machine learning’, it will only consider the longest match, which is Machine Learning. We have made python implementation of this algorithm available as open-source on GitHub,1 released under the permissive MIT License. Subjects Data Structures and Algorithms (cs.DS) Keywords Information retrieval, Keyword Search, Regex, Keyword Replace, FlashText
منابع مشابه
Search Result Clustering Method at NTCIR-5 Web Query Expansion Subtask
We use a retrieval system with search result clustering to tackle the NTCIR-5 WEB Query Term Expansion Subtask. The system clusters the search results in such a way as to make it easier for the user to select relevant documents as feedback documents. In addition, we select phrase words or named entities(NE) as query-expansion keywords from the feedback documents because these words tend to repr...
متن کاملSemantic Retrieval System Based on Ontology
The recall factor is low when keywords are used to retrieve information, and many related documents are omitted. Semantic annotation is used to comment documents to improve the recall factor. While extremely large instances querying requirements may crash ABox reasoner. In this research, a method is proposed to improve the efficiency of semantic retrieving via combining ABox reasoning and datab...
متن کاملSemantic Based Information Extraction from Web
Extraction of information from web is a challenging task. The information stored in a web may be structured or unstructured information. The structured information provides enhanced knowledge which helps to retrieve relevant documents. It helps the user to understand particular domain. This paper explores the importance of information extraction using semantics. It enables the users to discover...
متن کاملخطا های شایع در کلید واژه های انگلیسی مقالات حوزه آموزش علوم پزشکی
Background and purpose: Author-assigned keywords at the end of the abstracts in scientific articles are the words most relevant to the content of the article. They are the main sources for indexing and storing the articles in databases, and help to retrieve related articles. Therefore, any mistake or ambiguity in keywords lead to disruption of both data storage and retrieval processes. This stu...
متن کاملA system for retrieving broadcast news speech documents using voice input keywords and similarity between words
This paper describes a robust speech documents retrieval system that uses voice input keywords. To solve the inevitable problems which arise when the input to the system is speech, i.e. misrecognition, a novel method was developed, where, before the retrieval processing, unproductive keyword candidates are discarded by a grouping processing using the similarity between words and the recognition...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1711.00046 شماره
صفحات -
تاریخ انتشار 2017